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Abstract 

Single molecule Forster resonance energy transfer (FRET) experiments are used to infer the 
properties of the denatured state ensemble (DSE) of proteins. From the measured average FRET 
efficiency, (E), the distance distribution P{R) is inferred by assuming that the DSE can described 
as a polymer. The single parameter in the appropriate polymer model (Gaussian chain, Worm-Like 
chain, or Self- Avoiding walk) for P{R) is determined by equating the calculated and measured (E). 
In order to assess the accuracy of this "standard procedure" , we consider the Generalized Rouse 
Model (GRM), whose properties {{E) and P{R)) can be analytically computed, and the Molecular 
Transfer Model for protein L for which accurate simulations can be carried out as a function of 
guanadinium hydrochloride (GdmCl) concentration. Using the precisely computed (E) for the 
GRM and protein L, we infer P{R) using the standard procedure. We find that the mean end-to- 
end distance can be accurately inferred (less than 10% relative error) using (E) and polymer models 
for P{R). However, the value extracted for the radius of gyration (Rg) and the persistence length 
(Ip) are less accurate. For protein L, the errors in the inferred properties increase as the GdmCl 
concentration increases for all polymer models. The relative error in the inferred Rg and Ip, with 
respect to the exact values, can be as large as 25% at the highest GdmCl concentration. We propose 
a self-consistency test, requiring measurements of (E) by attaching dyes to different residues in 
the protein, to assess the validity of describing DSE using the Gaussian model. Application of the 
self-consistency test to the GRM shows that even for this simple model, which exhibits an order 
disorder transition, the Gaussian P{R) is inadequate. Analysis of experimental data of FRET 
efficiencies with dyes at several locations for the Cold Shock protein, and simulations results for 
protein L, for which accurate FRET efHciencies between various locations were computed, shows 
that at high GdmCl concentrations there are significant deviations in the DSE P{R) from the 
Gaussian model. 
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Introduction: Much of our understanding of how proteins fold comes from experiments 
in which folding is initiated from an ensemble of initially unfolded molecules whose structures 
are hard to characterize [1]. In many experiments, the initial structures of the denatured 
state ensemble (DSE) are prepared by adding an excess amount of denaturants or by raising 
the temperature above the melting temperature (T^) of the protein [2]. Theoretical studies 
have shown that folding mechanisms depend on the initial conditions, i.e. the nature of the 
DSE [3]. Thus, a quantitative description of protein folding mechanisms requires a molecular 
characterization of the DSE - a task that is made difficult by the structural diversity of the 
ensemble of unfolded states [4, 5]. 

In an attempt to probe the role of initial conditions on folding, single molecule FRET 
experiments are being used to infer the properties of unfolded proteins. The major advantage 
of these experiments is that they can measure the FRET efficiencies of the DSE under 
solution conditions where the native state is stable. The average denaturant-dependent 
FRET efficiency {E) has been used to infer the global properties of the polypeptide chain 
in the DSE as the external conditions are altered. The properties of the DSE are inferred 
from {E) by assuming a polymer model for the DSE, from which the root mean squared 
distance between two dyes attached at residues i and j along the protein sequence {Rij = 
{\ri — Tjl)), the distribution of the end-to-end distance P{R) (where R — |rjv — ro|), the 
root mean squared end-to-end distance {Ree — (R^)^/^), the root mean squared radius of 
gyration {Rg — (R^)^), and the persistence length (Ip) of the denatured protein [6-15] can 
be calculated. 

In FRET experiments, donor (D) and acceptor (A) dyes are attached at two locations 
along the protein sequence [4, 16], and hence can only provide information about correlations 
between them. The efficiency of energy transfer E between the D and A is equal to (1 + 
t^/Rq)'^, where r is the distance between the dyes, and Rq is the dye-dependent Forster 
distance [4, 16]. Because of conformational fluctuations, there is a distribution of r, P(r), 
which depends on external conditions such as the temperature and denaturant concentration. 
As a result, the average FRET efficiency (E) is given by 



under most experimental conditions, due to the central limit theorem [17]. If the dyes are 
attached to the ends of the chain, then P(r) = P{R)- Even if (E) is known accurately, the 
extraction of P{R) from the integral equation (Eq. 1) is fraught with numerical instabilities. 
In experimental applications to biopolymers, a functional form for P{r) is assumed in order 
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to satisfy the equality in Eq. 1. The form of P{r) is based off of a particular polymer model 
which depends only on a single parameter (see Table I): The Gaussian chain (dependent on 
the Kuhn length a), the Wormlike Chain (WLC; dependent on the persistence length Ip), 
and the Self Avoiding Walk (SAW; dependent on the average end-to-end distance Ree)- For 
the chosen polymer model meant to represent the biopolymer of interest, the free parameter 
(a, /p, or Ree) is determined numerically to satisfy Eq. 1. Using this method (referred to as 
the "standard procedure" in this article), several researchers have estimated Rg and Ip as a 
function of the external conditions for protein L [11, 14], Cold Shock Protein (CspTm) [13], 
and Rnase H [16]. The justification for using homopolymer models to analyze FRET data 
comes from the anecdotal comparison of the Rg measured using X-ray scattering experiments 
and the extracted Rg from analysis of Eq. 1 [4] . 

Here, we study an analytically solvable generahzed Rouse model (CRM) [18] and the 
Molecular Transfer Model (MTM) for protein L [19] to assess the accuracy of using polymer 
models to solve Eq. 1. In the CRM, two monomers that are not covalently linked interact 
through a harmonic potential that is truncated at a distance c. The presence of the addi- 
tional length scale, c, which reflects the interaction between non-bonded beads, results in the 
formation of an ordered state as the temperature (T) is varied. A more detailed discussion of 
these models can be found in the Methods section. For the CRM, P{R) can be analytically 
calculated, and hence the reliability of the standard procedure to solve Eq. 1 can be unam- 
biguously established. We find that the accuracy of the polymer models in extracting the 
exact values in the CRM depends on the location of the monomers that are constrained by 
the harmonic interaction. Using coarse-grained simulations of protein L, we show that the 
error between the exact quantity and that inferred using the standard procedure depends 
on the property of interest. For example, the inferred end-to-end distribution P{R) is in 
qualitative, but not quantitative agreement with the exact P{R) distribution obtained from 
accurate simulations. In general, the DSE of protein L is better characterized by the SAW 
polymer model than the Gaussian chain model. 

We propose that the accuracy of the popular Gaussian model can be assessed by mea- 
suring (E) with dyes attached at multiple sites in a protein [13, 20, 21]. If the DSE can 
be described by a Gaussian chain, then the parameters extracted by attaching the dyes 
at position i and j can be used to predict (E) for dyes at other points. The proposed 
self-consistency test shows that the Gaussian model only qualitatively accounts for the ex- 
perimental data of CspTm, simulation results for protein L, and the exact analysis of the 
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GRM. 

Results and Discussion 

We present the results in three sections. In the first and second sections we examine 
the accuracy of the standard procedure (described in the introduction) in accurately 
inferring the properties of the denatured state of the GRM and protein L models. The third 
section presents results of the Gaussian Self-consistency Test applied to these models. We 
also analyze experimental data for CspTm to assess the extent to which the DSE deviates 
from a Gaussian chain. 

I. GRM: The Generalized Rouse model (GRM) is a simple modification of the Gaussian 
chain with N bonds and Kuhn length qq, which includes a single, non-covalent bond 
between two monomers at positions Si and S2 (Fig. 1). The monomers at Si and S2 interact 
with a truncated harmonic potential with spring constant k, with strength k = kc^/2, where 
c is the distance at which the interaction vanishes (Eq. 4). The GRM minimally represents 
a two state system, with a clear demarcation between ordered (with |r(s2) — i"(si)| < c) 
and disordered (with |r(s2) — r(-Si)| > c) states. Unlike other polymer models (see Table I), 
which are characterized by a single length scale, the GRM is described by Qq and the energy 
scale K. For /3k — > (the high temperature limit, where P — l/ksT), the simple Gaussian 
chain is recovered (see Methods for details). By varying (3k, a disorder — > order transition 
can be induced (see Fig. 1). The presence of the interaction between monomers Si and 
S2 approximately mimics persistence of structure in the DSE of proteins. If the fraction 
of ordered states, fo, exceeds 0.5 (Fig. 1 inset), we assume that the residual structure is 
present with high probability. The exact analysis of the GRM when |r(s2) — r(si)| < c 
allows us to examine the effect of structure in the DSE on the global properties of unfolded 
states. 

Because {E) can be calculated exactly for the GRM (see Eq. 5), it can be used to quanti- 
tatively study the accuracy of solving Eq. 1 using the standard procedure [6, 10, 11, 13, 14]. 
Given the best fit for the Gaussian chain (Kuhn length a), WLC (persistence length Ip), 
and SAW (average end-to-end distance Ree), as described in Table I, many quantities of 
interest can be inferred (-P(-R) or Rg, for example), and compared with the exact results 
for the GRM. The extent to which the exact and inferred properties deviate, due to the 
additional single energy scale in the GRM, is an indication of the accuracy of the standard 
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procedure used to analyze Eq. 1. 

P{R) is accurately inferred using the Gaussian polymer model: If the inter- 
acting monomers are located near the endpoints of the chain, the end-to-end distribution 
function is bimodal, with a clear distinction between the ordered and disordered regions 

[18]. However, if the monomers Si and S2 are in the interior of the chain, the two-state 
behavior is obscured because the distribution function becomes unimodal. In Fig. 1, we 
show the exact and inferred P{R) functions for a chain with N = 63, Oq = 3.8A, c = 2ao, 
and |S2 - si| ^ {N - l)/2 = 31. We take the Forster distance (Eq. 1) Ro = 23A< (R^)^^ 
for the GRM. The distributions are unimodal for both weakly (/3k = 2) and strongly 
{(3k — 6.6) interacting monomers. 

The strength of the interaction is most clearly captured with the fraction of confor- 
mations in the ordered state, fo, with fo — 0.25 for the weakly interacting chain and 
fo = 0.75 for the strongly interacting chain (inset of Fig. 1). The inferred Gaussian 
distribution functions arc in excellent agreement with the exact result. Because of the 
underlying Gaussian Hamiltonian in the GRM, the rather poor agreement in the inferred 
SAW distribution seen in Fig 1 is to be expected. We also note that the GRM is inher- 
ently flexible, so that the WLC and Gaussian chains produce virtually identical distributions. 

The accuracy of the inferred Eg depends on the location of the interac- 
tion: The two-state nature of the GRM is obscured by the relatively long unstructured 
regions of the chain, similar to the effect seen in laser optical tweezer experiments with 
flexible handles [18]. As a result, P{R) is well represented by a Gaussian chain, with a 
smaller inferred Kuhn length, a < (Fig. 2). For large /Sk, where the ordered state is 
predominantly occupied and r(s2) ~ r(si), the end-to-end distribution function is well 
approximated by a Gaussian chain with N* = N — As bonds. Consequently, the single 
length scale for the Gaussian chain, decreases to a ~ aQ^/l^^'KsjN ^ 0.71ao for large 
values of (5n (Fig. 2). 

Because the two-state nature of the chain is obscured for certain values of |s2 — Sil, the 
Gaussian chain gives an excellent approximation to the end-to-end distribution function. 
However, the radius of gyration Rg is not as accurately obtained using the Gaussian chain 
model, as shown in Fig. 3. The exact Rg for the GRM reflects both the length scale Oq and 
the energy scale (3k, which can not be fully described by the single inferred length scale a 



6 



in the Gaussian chain. For the GRM, Rg depends not only on the separation between the 
monomers As, but also explicitly on si (i.e. where the interaction is along the chain; see 
Fig. 3 and the Methods section), which can not be captured by the Gaussian chain. If the 
interacting monomers are in the middle of the chain [si = [N + 16 and As = 31), 

the inferred Rg is in excellent agreement with the exact result (Fig. 3). The relative error in 
Rg (the difference between the inferred and exact values, divided by the exact value) is no 
less than -2%. However, for interactions near the endpoint of the chain, with si = and the 
same As = 31, the relative error between the inferred and exact values of Rg is ~ —14%. 
The large errors arise because the radius of gyration depends on the behavior of all of the 
monomers, so that the energy scale (5k plays a much larger role in the determination of Rg 
than Ree- 

II. MTM for protein L: Protein L is a 64 residue protein (Fig. 4A) whose fold- 
ing has been studied by a variety of methods [11, 14, 22-24]. More recently, single molecule 
FRET experiments have been used to probe changes in the DSE as the concentration of 
GdmCl is increased from to 7 M [11, 14]. From the measured GdmCl-dcpcndcnt (£'), the 
properties of the DSE, such as i?ee, P{R), and Rg, were extracted by solving Eq. 1, and 
assuming a Gaussian chain P{R) [11, 14]. To further determine the accuracy of polymer 
models in the analysis of (E), we use simulations of protein L in the same range of the 
concentration of denaturant, [C], as used in experiments [6, 9]. 

The average end-to-end distance is accurately inferred from FRET data: 

In a previous study [19], we showed that the predictions based on MTM simulations for 
protein L arc in excellent agreement with experiments. From the calculated (E) with the 
dyes at the endpoints (solid black line in Fig. 4B), which is in quantitative agreement with 
experimental measurements [19], we determine the model parameter R^e or Ip by assuming 
that the exact P{R) can be approximated by the three polymer models in Table 1. Compar- 
ison of the exact value of Ree to the inferred value Rp, obtained using the simulation results 
for (E), shows good agreement for all three polymer models (Fig. 5A). There are deviations 
between Rgg and Rp at [C] > C^, the midpoint of the folding transition. The maximum 
relative error (see inset of Fig. 5A) we observe is about 10% at the highest concentration of 
GdmCl. The SAW model provides the most accurate estimate of Rge at GdmCl concentra- 
tions above C^, with a relative error < 0.05, and the Gaussian model gives the least accurate 
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values, with a relative error < 0.10 (Fig. 5A). Due to the relevance of excluded volume in- 
teraction in the DSE of real proteins, the better agreement using the SAW is to be expected. 

Polymer models do not give quantitative agreement with the exact P{R): 
The inferred distribution functions, PF(i?)'s, obtained by the standard procedure (as 

described in the introduction) at [C]=2 M and 6 M GdmCl differ from the exact results 
(Fig. 5B). Surprisingly, the agreement between P{R) and Pf{R) is worse at higher [C]. The 
range of R explored and the width of the exact distribution are less than predicted by the 
polymer models. The Gaussian chain and the SAW models account only for chain entropy, 
while the WLC only models the bending energy of the protein. However, in protein L 
(and in other proteins) intra-molecular attractions are still present even when [C]=6 M > 
C^. As a result, the range of R explored in the protein L simulations is expected to be 
less than in these polymer models. Only at [C]/C^ >> 1 and/or at high T are proteins 
expected to be described by Flory random coils. Our results show that although it is 
possible to use models that can give a single quantity correctly {Ree, for example), the 
distribution functions are less accurate. The results in Fig. 5B show that P{R), inferred 
from the polymer models, agrees only qualitatively with the exact P{R), with the SAW 
model being the most accurate (Fig. 5B). While the MTM will not perfectly reproduce 
all of the fine details of Protein L under all situations, we expect it to produce more re- 
alistic results than idealized polymer models, which have no specific intra-chain interactions. 

Inferred Rg and Ip differ significantly from the exact values: The solution of 
Eq. 1 using a Gaussian chain or WLC model yields a and Ip, from which Rg can be 

analytically calculated (Table 1). Figs. 6A and 6B, which compare the FRET inferred 
Rg and Ip with the corresponding values obtained using MTM simulations, show that the 
relative errors are substantial. At high [C] values the Rg deviates from Rg by nearly 25% 
if the Gaussian chain model is used (Fig. 6A). The value of i?^ ^ 26 A at [G]= 8 M 
while Rg using the Gaussian chain model is ~ 31 A. In order to obtain reliable estimates 
of Rg, an accurate calculation of the distance distribution between all the heavy atoms 
in a protein is needed. Therefore, it is reasonable to expect that errors in the inferred 
P{R) are propagated, leading to a poor estimate of internal distances, thus resulting in a 
larger error in Rg. A similar inference can be drawn about the persistence length obtained 
using polymer models (Fig. 6B). Plotting /J as a function of [C] (Fig. 6B), against 
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Ip — Ree/2L, shows that Ip is overestimated at concentrations above 1 M GdmCl, with 
the error increasing as [C] increases. The error is less when the Gaussian chain model is used. 

III. Gaussian Self-consistency test shows the DSE is non-Gaussian: The 

extent to which the Gaussian chain accurately describes the ensemble of conformations that 
are sampled at different values of the external conditions (temperature or denaturants) can 
be assessed by performing a self-consistency test. A property of a Gaussian chain is that if 
the average root mean square distance, Rij, between two monomers i and j is known then 
Rki, the distance between any other pair monomers k and I, can be computed using 



Thus, if the conformations of a protein (or a polymer) can be modeled as a Gaussian chain, 
then Rij inferred from the FRET efficiency (Eij) should accurately predict Rki and the 
FRET efficiency (Eki), if the dyes were to be placed at monomers k and I. We refer to this 
criterion as the Gaussian self-consistency (GSC) test, and the extent to which the predicted 
Rki from Eq. 2 deviates from the exact Rki reflects deviations from the Gaussian model 
description of the DSE. 

GEM: For the GRM, with a non-bonded interaction between monomers Si and we 
calculate {Eij) using Eq. 8 with j fixed at and for i = 20,40, and 60. Using the exact 
results for {Eij), the values of Rij are inferred assuming that P(r) is a Gaussian chain. 
From the inferred Rij the values of (E^i) and T?/,/ can be calculated using Eqs. 1 and 2, 
respectively. We note that, since Rki/ Rij — y^\k — l\/\i — j\ (Eq. 2) for any pair (k, I) using 
the Gaussian chain model, the prediction of the Gaussian chain will be independent of the 
particular choices of k and I, as long as their difference is held constant. We first apply the 
GSC test to a GRM in which fo ~ 0.75 due to a favorable interaction between monomers 
Si = 16 and S2 = 47. There are discrepancies between the values of the Gaussian inferred 
{Rki) and exact Rki distances, as well as the inferred {{E^^)) and exact (Eij) efficiencies 
when a Gaussian model is used (Fig. 7). The relative errors in the predicted values of the 
FRET efficiency and the intcr-dyc distances can be as large as 30-40%, depending on the 
choice of i and j (see insets in Fig. 7). We note that the relative error in the end-to-end 
distance is small for dyes near the endpoints (the green line in Fig. 7b), in agreement with 
the results shown in Fig. 1. The errors decrease as fo decreases, with a maximum error of 
20% when fo — 0.5, and 10% when fo — 0.25 (data not shown). By construction, the GRM 
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is a Gaussian chain when fo — and therefore the relative errors will vanish at sufficiently 
small Pk, (data not shown). These results show that even for the GRM, with only one non- 
bonded interaction in an otherwise Gaussian chain, its DSE cannot be accurately described 
using a Gaussian chain model. Thus, even if the overall end-to-end distribution P{r) for the 
GRM is well approximated as a Gaussian (as seen in Fig. 1), the internal Rki monomer pair 
distances can deviate from predictions of the Gaussian chain model. 

Protein L: We apply the GSC test to our simulations of protein L at GdmCl concen- 
trations of [C]=2.0 M (below C^=2.4M) and [C]=7.5 M (well above C^). While our 
simulations allow us to compute the DSE (Eij) for all possible (i, j) pairs, we examine only 
a subset of (Eij) as a function of GdmCl concentration (Fig. 4B). By choosing multiple 
j values for the same value of i, we can determine whether distant residues along the 
backbone are close together spatially, which may offer insights into three-point correlations 
in denatured states. We note that all values of (Eij) in Fig. 4 are monotonically decreasing, 
except for the (1,14) pair. This is due to the fact that the native state has a beta-strand 
between these two residues; as the protein denatures, they come closer together, increasing 
the FRET efficiency. We use these values for (Eij) in the GSC test. The results are shown 
in Figs. 8A and 8B. Relative errors in {E^i) as large as 36% at 2.0 M GdmCl and 50% 
at 7.5 M GdmCl are found, with the lowest errors generally seen for residues close to one 
another along the backbone, in agreement with the results from the GRM (Fig. 7a inset). 
In addition, the number of data points that underestimate {E^i) increases as [C] is changed 
from 7.5 M to 2.0 M for — /| < 20. Despite these differences, the gross features in Figs. 
8A and 8B are concentration independent. Because the error does not vanish for all {k, I) 
pairs (Figs. 8A and 8B), we conclude that the DSE of protein L cannot be modeled as a 
Gaussian chain. 

The GSC test for CspTm: In an interesting single molecule experiment, Schuler 
and coworkers have measured FRET efficiencies by attaching donor and acceptor dyes to 
pairs of residues at five different locations of a CspTm [13]. They analyzed the data by 
assuming that the DSE properties can be mimicked using a Gaussian chain model. We 
used the GSC test to predict (Eki) for dyes separated by \k — l\ along the sequence using 
the experimentally measured values {Ey). 

The relative error in (E^i) (Eq. 2) should be zero if CspTm can be accurately modeled 
as a Gaussian chain. However, there are significant deviations (up to 17%) between the 
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predicted and experimental values (Fig. 9). The relative error is fairly insensitive to the 
denaturant concentration (compare Figs. 9A and 9B). It is interesting to note that the 
trends in Fig. 9 are quahtatively similar to the relative errors in the GRM at fo > 0. 
Based on these observations we conclude tentatively that whenever the DSE is ordered 
to some extent (i.e., when there is persistent residual structure) then we expect devia- 
tions from a homopolymer description of the DSE of proteins. At the very least, the GSC 
test should be routinely used to assess errors in the modeling of the DSE as a Gaussian chain. 

Conclusions 

In order to assess the accuracy of polymer models to infer the properties of the DSE 
of proteins from measurement of FRET efficiencies, we studied two models for which 
accurate calculations of all the equilibrium properties can be carried out. Introduction of 
a non-bonded interaction between two monomers in a Gaussian chain (the GRM) leads 
to an disorder- order transition as the temperature is lowered. The presence of 'residual 
structure' in the GRM allows us to clarify its role in the use of the Gaussian chain model 
to fit the accurately calculated FRET efficiency. Similarly, we have used the MTM model 
for protein L to calculate precisely the denatur ant-dependent (E) from which we extracted 
the global properties of the DSE by solving Eq. 1 using the P(i?)'s for the polymer models 
in Table I. Quantitative comparison of the exact values of a number of properties of the 
DSE (obtained analytically for the GRM and accurately using simulations for protein L) 
and the values inferred from (E) has allowed us to assess the accuracy with which polymer 
models can be used to analyze the experimental data. The major findings and implications 
of our study are listed below. 

(1) The polymer models, in conjunction with the measured (E), can accurately infer 
values of i?eei the average end-to-end distance. However, P{R), Ip, and Rg are not quanti- 
tatively reproduced. For the GRM, Rg is underestimated, whereas it is overestimated for 
protein L. The simulations show that the absolute value of the relative error in the inferred 
Rg can be nearly 25% at elevated GdmCl concentration. 

(2) We propose a simple self consistency test to determine the abihty of the Gaussian 
chain model to correctly infer the properties of the DSE of a polymer. Because the Gaussian 
chain depends only on a single length scale, the FRET efficiency can be predicted for varying 
dye positions once (E) is accurately known for one set of dye positions. The GSC test shows 
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that neither the GRM, simulations of protein L, nor experimental data on CspTm can 
be accurately modeled using the Gaussian chain. The relative errors between the exact 
and predicted FRET efficiencies can be as high as 50%. For the GRM, we find that the 
variation in the FRET efficiency as a function of the dye position changes abruptly if one 
dye is placed near an interacting monomer. Taken together these findings suggest that it is 
possible to infer the structured regions in the DSE by systematically varying the location 
of the dyes. This is due to the fact that the FRET efficiency is perfectly monotonic using 
the Gaussian Chain model. An experiment that shows non- monotonic behavior in {Eij) as 
the dye positions i and j are varied is a clear signal of non-Gaussian behavior, and sharp 
changes in the FRET efficiency as a function ol\i—j\ may indicate strongly interacting sites 
(see Fig. 7a). 

(3) The properties of the DSE inferred from Eq. 1 become increasingly more accurate as 
[C] decreases. At a first glance this finding may be surprising, especially considering that 
stabilizing intra-peptide interactions are expected to be weakened at high GdmCl concen- 
trations [C], and therefore the protein should be more "polymer-like." The range of i?- values 
sampled at low [C] is much smaller than at high [C]. Protein L swells as [C] is increased, 
as a consequence of the increase in the solvent quality. It is possible that [C]~2.4 M might 
be close to a 0-solvent (favorable intrapeptide and solvent-peptide interactions are almost 
neutralized), so that P{Ft) can be approximated by a polymer model. The inaccuracy of 
polymer models in describing P{R) at [C]=6 M suggests that only at much higher concen- 
trations does protein L behave as a random coil. In other words, T'=327.8 K and [C]=6 M 
is not an athermal (good) solvent. 

(4) It is somewhat surprising that polymer models, which do not have side chains or any 
preferred interactions between the beads, are qualitatively correct in characterizing the DSE 
of proteins with complex intramolecular interactions. In addition, even [C]=6 M GdmCl is 
not an athermal solvent, suggesting that at lower [C] values the aqueous denaturant may be 
closer to a Q-solvent. A consequence of this observation is that, for many globular proteins, 
the extent of collapse may not be significant, resulting in the nearness of the concentrations 
at which collapse and folding transitions occur, as shown by Camacho and Thirumalai [25] 
some time ago. We suggest that only by exploring the changes in the conformations of 
polypeptide chains over a wide range of temperature and denaturant concentrations can 
one link the variations of the DSE properties (compaction) and folding (acquisition of a 
specific structure). 
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Theory and computational methods 



GRM model: In order to understand the effect of a single non-covalent interaction 
between two monomers along a chain, we consider a Gaussian chain with Kuhn length oq 
and bonds, with a harmonic attraction between monomers Si < S2, which is cutoff at a 
distance c. The Hamiltonian for the GRM is 

PH ^ ^£ ds P{s) + pV[Tis2) - vis,)] (3) 
I \r\<c 

pV[r] = { , (4) 

y |r| > c 

where k is the spring constant that constrains r(s2) — i'(si) to a harmonic well. The Hamil- 
tonian in Eq. 3 allows the exact determination of many quantities of interest. Defining 
X = r(s2) — and As = S2 — Si, we can determine most averages of interest for the GRM 
using 

_ / d^Yid^^d^YN{- ■ ■)G{^, tn] As, N) 

~ Jd^rid^^d^TN G{^,rN;As,N) ^' 
G(x. r„; A.. TV) = exp ( - ^ - - PVI^) . (6) 

Cq-SCM protein model and GdmCl denaturation: We use the coarse-grained Cq,- 
side chain model (Cq,-SCM) to model protein L (for details see the supporting information 
in [19]). In the Co,-SCM each residue in the polypeptide chain is represented using two 
interaction sites, one that is centered on the a-carbon atom and another that is located at 
the center-of-mass of the side chain [26]. Langevin dynamics simulations [27] are carried out 
in the underdamped limit at zero molar guanidinium chloride. Simulation details are given 
in [19]. 

We model the denaturation of protein L by GdmCl using the molecular transfer model 

(MTM) [19]. MTM combines simulations at zero molar GdmCl with experimentally mea- 
sured transfer free energies, using a rcwcighting method [28-30] to predict the equilibrium 
properties of proteins at any GdmCl concentration of interest. 



Analysis: 
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GRM: The average squared end-to-end distance can be computed directly from Eq. 
5, using (Rge) — NttQ + ((x^) — Asco). The exact expression for (x^) is easily determined, 
but somewhat lengthy, and we omit the explicit result here. Also of interest is the 
end-to-end distribution function, P(R) = {5[v]^ — 'R\), which can be obtained from Eq. 
5. In order to determine the probability of an interior bond being in the 'ordered' state 
(i.e. the fraction of residual structures, see the inset for Fig. la), we compute the interior 
distribution, P/(X) = ((5[x — X]), so that fo = i|x|<c'^'^-^-^^(-^)- '^^'^ radius of gyration 
requires a more complicated integral than the one found in Eq. 5, but we find 



2 



Note that, unlike the average end-to-end distance, the radius of gyration depends not only 
on As, but also on si. 

The FRET efficiency for a system with dyes attached to r(j = 0) = and r(i), {E) — 
([1 + (|r(i)|/i?o)'']"^), is determined from Eq. 5 as 



As si _ / As si\2- 



E^{i) < i < si 

si<i<S2 (8) 
S2<i<N 



Jo°° dxdrg-,{x,r;{si})/[\+{r/Ihf 
/o"" dxdr g\{x,r;{si}) 

Jo°° dxdrg2(x,r;{si})/[l+(r/B.of 
/o"" dxdr g2(x,T;{si]) 



where E'^{i) is the FRET efficiency for a Gaussian chain with i bonds, and 
gi{x,r;{s,}) = xr sinh ( ^^' ~ ^^)^^ y~s(ix-+Asr-)/2Xal-m^] 

^2(x,r;{s,}) = xr sinh (^_J|__)e-^-V2A.ae-3(.^+.^)/2(i-A.)ae-/9y[.] (iQ) 
A = (S2 + Si)i - s? - (11) 



This result allows us to compute the Gaussian Self-consistency test, after a numerical 
integral over r. 

Protein L: Averages and distributions were computed using the MTM [19] which 
combines experimentally measured transfer free energies [31], converged simulations and 
the WHAM equations [28-30]. The WHAM equations use the simulation time-series of 
potential energy and the property of interest at various temperatures and gives a best 
estimate of the averages and distributions of that property. The native state ensemble 
(NSE) and DSE subpopulations were defined as having a structural RMSD (root mean 
squared deviation), after least squares minimization, of less than or greater than 5 A 
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relative to the crystal structure for the NSE and DSE respectively. The exact values of 
Ip are computed using the average R from simulations and the relationships listed in Table I. 

Notation: Throughout the paper, exact values of all quantities are reported with- 
out superscript or subscript. For the GRM, exact values are analytically obtained or 
calculated by performing a one-dimensional integral numerically. For convenience, exact 
results for protein L refer to converged simulations. While these simulations have residual 
errors, the simplicity of the MTM has allowed us to calculate all properties of interest with 
arbitrary accuracy. The use of subscript or superscript is, unless otherwise stated, reserved 
for quantities that are extracted by solving Eq. 1 using the polymer models hsted in Table I. 
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TABLE I: Polymer models and their properties 



Property 



Polymer Model 



End-to-end distribution P{R)°- 



Radius of gyration Rg 



Persistence length I, 



■p 



Worm-like Chain^ 



Gaussian 



Self Avoiding Polymer'^ 




6C2 4C^ IlC^ 8C|L2 

N/A 



1 1 _ l-exp(-L/;p) 



a^/N/6 





No? _ a 
2L ~ 2 



N/A 



"The average end-to-end distance Ree = (/ R^P{R)dR)^^'^ 

and Ip are the contour length and persistence length respectively. Ci = (7r^/^e~"a~^/^(l -|- 3a~^ -|- ^(y.~'^))~^ where a = 3L/(4Zp). C2 = l/(2Zp). 
'^Using the simulated (i?^), Ip was solved for numerically using this equation. 

"^^ and d equal 0.3 and 2.5, respectively. The constants o and b are determined by solving the integrals of the zeroth and second moment of / P{R)dr 
J R^P{R)dr = 1, resulting in values of a = 3.67853 and b = 1.23152. 



Figure Captions 



Figure 1: Top figures siiows a sdiematic sketcli of tlie GRM, witii tlie donor and acceptor 
at tiie endpoints, represented by tiie green spiieres, and tiie interacting monomers at Si 
and S2 represented by the red spfieres. In tfie ordered configuration, the monomers at si 
and S2 are tightly bound. The bottom figure shows the exact and the inferred end-to-end 
distribution functions P{r) for interior interactions (As = 31). The blue lines correspond to 
the Gaussian chain model, light green lines to the SAW, and the symbols to the exact GRM 
distribution. Dashed lines and red circles are for f3K = 6.6, while solid lines and red squares 
correspond to = 2. In the inset we show the fraction of ordered states as a function of 
/Sk. Note that 75% of the structures are ordered at = 6.6, yet the inferred Gaussian P(r) 
is in excellent agreement with the exact result. 

Figure 2: The inferred Kuhn length a as a function of Pk for the GRM. i?ee monotonically 
decreases a function of the interaction strength, leading to the decrease in a/uQ. The Kuhn 
length a reaches its limiting value of a ~ aQ^/l — As/N when fo ^ 1- 

Figure 3: Comparison of the exact (symbols) and inferred (blue hue) values of the radius 
of gyration (Rg) as a function of Pk for As = 31. Shown are i?g's for the GRM with si — 
(open symbols) and si = 16 (filled symbols) for N = 63. The structures in the ordered state 
are shown schematically. The Rg obtained using the standard procedure is independent of 
Si, while the exact result is not. The inset shows the relative errors between the inferred 
and exact values of Rg. 

Figure 4: (a) A secondary structure representation of protein L in its native state. Start- 
ing from the N-terminus, the residues are numbered 1 through 64. (b) The average FRET 
efficiency between the various residue pairs in protein L versus GdmCl concentration. 
The (Eij) values, computed using MTM simulations, for each pair is indicated by the 
two numbers next to each line. For example, the numbers '1-64' beneath the black line 
indicates that i = 1 and j — 64. The solid black fine (lowest values of {E)) is computed for 
the dyes at the endpoints. 

Figure 5: (a) The root mean squared end-to-end distance (-Ree) as a function of GdmCl 
concentration for protein L. The average R^e (black circles) and the R for the sub-population 
of the DSE (red squares) from simulations are shown. The values of Ree inferred by solving 
Eq. (1) by the standard procedure using the Gaussian chain. Worm Like Chain, and Self 
Avoiding polymer models are shown for comparison as the top, middle and bottom solid 
lines respectively. The inset shows the relative error between the exact and the values 
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inferred using the FRET efficiency for Ree versus GdmCl concentration. Tfie top, middle 
and bottom lines correspond to the Gaussian chain, Worm Like Chain and Self Avoiding 
Walk polymer models respecitvely. (b) Simulation results of the denatured state end-to-end 
distance distribution (-P(-R)) at 2.4 M GdmCl (solid red squares) and 6 M GdmCl (open red 
squares) and T=327.8 K are compared with P{R)s using the Gaussian chain. Worm Like 
Chain, and Self Avoiding Walk polymer models are also shown at 2.4 M GdmCl (dashed 
lines) and 6 M GdmCl (solid lines). The top middle and bottom lines correspond to the Self 
Avoiding Walk, Worm Like Chain, and Gaussian chain polymer models. 

Figure 6: (a) Comparison of Rg from direct simulations of protein L and that obtained 
by solving Eq. (1) using the Gaussian chain, and Worm Like Chain polymer models. The top 
hne (magenta) shows the WLC fit, the bottom hne (blue) shows the Gaussian fit, red squares 
show the DSE Rg from the simulation, and black circles show the average simulated Rg. The 
inset shows the relative errors as a function of GdmCl concentration; top and bottom lines 
correspond to the Gaussian chain and Worm Like Chain polymer models respectively, (b) 
Same as (a) except the figure is for Ip. Top and bottom lines correspond to the inferrred 
Ip using the Gaussian chain and Worm Like Chain polymer models respectively. Top and 
bottom sets of squares correspond to a direct analysis of the simulations using the Worm 
Like Chain and Gaussian chain polymer models respectively. 

Figure 7: Gaussian Self-consistency test using (a) the FRET efficiency and (b) the 
average end-to-end distance for the CRM with fo — 0.75 and interaction sites at Si = 16 
and S2 = 47. In both (a) and (b) the solid lines are the inferred properties and the open 
symbols are the exact values. In both (a) and (b), j — and the blue, magenta, and green 
lines correspond to a dye at i = 20, 40, and 60, respectively. The insets show the relative 
error for (Eki) and R^i. Note that the relative error would be zero if the Gaussian chain 
accurately modeled the CRM. 

Figure 8: The Gaussian self consistency test applied to simulated DSE {Eij) data of 
protein L using the pairs listed in Fig. 4B. Shown are the relative errors at (a) 2.0 
M GdmCl and (b) 7.5 M GdmCl. In both (a) and (b), solid green circles correspond to 
|i — j'l = 13, open orange squares to |i — j| = 16, blue squares to |i — i| = 19, open brown 
circles to \i — j\ — 29, cyan * to \i — j\ — 30, red diamonds to \i — j\ — 34, sohd violet 
triangles to \i — j\ — 44, open grey triangles to \i — j\ — 50, and magenta x's to \i — j\ — 54. 
The color of each point corresponds to the color of each hne in Fig. 4b, except for the 1-64 
pair, which is not shown here. 
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Figure 9: The Gaussian Self-consistency test (GSC) using experimental data from 
CspTm. One dye was placed at one endpoint, and the location of the other was varied. 
We show relative error of the predicted (E), using Eqs. 1 and 2, versus the distance between 
the dyes {\k — l\) for [C]=2M (a) and 5M (b). In both (a) and (b), triangles correspond to 
\i — j\ = 33, x's to \i — j\ = 45, diamonds to \i — j\ = 46, squares to \i — j\ = 57, and circles 
to |i — i| = 65. The trends in Figs. (7) and(8) are similar. 
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